Skip to content

Conversation

joecummings
Copy link
Member

@joecummings joecummings commented Oct 8, 2025

Context

Are we correctly clearing our prefix cache? Hard to know without proper testing. Hence, this PR.

What changed?

  • Modified policy to 1) just call clear prefix cache on scheduler and 2) attach num tokens cached to metadata
  • Add "metadata" to Completions
  • Add test that checks that the number of tokens used in the cache are expected

Open questions

  • Do you find this test to be sufficient proof that we are using the prefix cache from vLLM "correctly"?

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 8, 2025
@joecummings joecummings changed the title Test vLLM prefix caching correctness vLLM prefix caching correctness (+ "test") Oct 8, 2025
@joecummings joecummings marked this pull request as ready for review October 8, 2025 18:32
According to the vLLM docs (https://docs.vllm.ai/en/v0.9.0/api/vllm/outputs.html#vllm.outputs.RequestOutput),
this is the number of tokens with a prefix cache hit. So, the logic is that if we run one generation,
then run another generation with the same start, we should see the number of cached tokens == the length of the prefix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

logger.info(f"Weight update completed (now v{self.policy_version})")

@endpoint
async def _reset_prefix_cache(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can have this convention of

Suggested change
async def _reset_prefix_cache(self):
async def _test_reset_prefix_cache(self):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I actually don't mind exposing to the end user! It could be used if someone wants to do something custom with their Policy setup.

Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test

logger.info(f"Weight update completed (now v{self.policy_version})")

@endpoint
async def _reset_prefix_cache(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example of needing
meta-pytorch/monarch#1455

)
expected_cached_tokens = 0
async for res in vllm_model.generate(
first_prompt, sampling_params, request_id="first_16"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the request_id matter here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, just for logging.

@joecummings joecummings merged commit 3dd4853 into meta-pytorch:main Oct 8, 2025
8 checks passed
@joecummings joecummings deleted the kv-caching-test branch October 8, 2025 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants